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Abstract 

The vocabulary of a continuous speech recognition (CSR) 
system is a significant factor in determining its performance. 
In this paper, we present three principled approaches to select 
the target vocabulary for a particular domain by trading off be- 
tween the target out-of-vocabulary (OOV) rate and vocabulary 
size. We evaluate these approaches against an ad-hoc baseline 
strategy. Results are presented in the form of OOV rate graphs 
plotted against increasing vocabulary size for each technique. 

1. Introduction 

The size and performance of a language model or speech recog- 
nition system are often strongly influenced by the size of its vo- 
cabulary. Ideally, the vocabulary is small, allowing us to build 
compact language models, and it is matched to the target do- 
main so that as many as possible of the domain-specific words 
are known to the recognition system. Compact language mod- 
els generate compact word graphs that are efficient to search and 
domain-matched vocabularies result in fewer out-of-vocabulary 
(OOV) words and consequently fewer recognition errors. In 
a study of the effect of OOV words on the Word Error Rate 
(WER) of a recognition system, Rosenfeld [ I,| arrives at a fig- 
ure of about 1.2 WER points per OOV word in a typical large 
vocabulary task. 

While a large and comprehensive vocabulary may be desir- 
able from the point of view of lexical coverage, we often set- 
tle for smaller and more tractable vocabularies. Not only are 
large vocabulary language models themselves very large, but 
for speech recognition systems, there is also the additional cost 
and effort involved in determining accurate pronunciations for 
every vocabulary entry. Even with the help of tools to generate 
pronunciations and check consistency of entries, this is a diffi- 
cult task |2). 

Furthermore, there is also the attendant problem of in- 
creased acoustic confusability for speech recognition systems 
when the vocabulary is large Q. For applications requiring 
a finite vocabulary, picking the right words for the vocabulary 
is especially important for achieving satisfactory performance. 
Usually, a number of text corpora from various domains and 
time periods are available on which to train. The target domain 
is known, and the amount of data available in the target domain 
is far less than in any of the training corpora. Clearly, restricting 
the vocabulary to just the words that are observable in a mea- 
ger amount of available domain data would be disastrous. On 
the other hand, including the union of the vocabularies of all 
the training corpora would be intractable. What we want in this 
situation is to assume that the vocabulary of the target domain 
is somehow related to the vocabularies of each training corpus, 
and subsequently infer the target vocabulary from the individual 



training corpus vocabularies, considering the observable portion 
of the domain text to be a sample. 

Even though vocabulary selection is an important issue and 
the problem appears to be simple, little work exists on this topic 
to date. The most common approaches seem to be ad hoc in 
nature, typically including words from each corpus that exceed 
some threshold frequency. This threshold depends on intuitions 
about the relevance of the corpus to the target domain |3J. In 
Rosenfeld's 1995 work 1 1 1 on optimizing vocabularies, atten- 
tion was mainly directed at determining the effect on the OOV 
rate of corpus recency, size and origin. While it was found 
that all three factors strongly affected the OOV rate, no specific 
guidelines were proposed as to how to combine the vocabular- 
ies from these different corpora to choose the target vocabu- 
lary. Indeed, Rosenfeld remarks that an ad hoc approach that 
discounted words by 1% for every week of age of the corpus 
reduced the OOV rate only very slightly for vocabulary sizes in 
the range of 20,000 to 50,000 words. 

The paucity of work on this important topic can partly be 
attributed to the general observation due to Zipf |4| that with 
even a moderate sized vocabulary chosen wisely, one can hope 
to get significant lexical coverage. Yet it is desirable from the 
point of view of scalability, extensibility and generality to study 
principled methods to address this problem. In this paper, we 
propose three such principled methods. The goal is to select a 
single vocabulary from many corpora of varying origins, sizes 
and recencies such that the vocabulary is optimized for both 
size and the OOV rate in the target domain. Section|2|defines 
the problem. Section|5|describes the proposed techniques, and 
Section|4|presents the results. 

2. Problem Description 

The vocabulary selection problem can be briefly summarized 
as follows. We wish to estimate the true vocabulary counts of 
a partially visible corpus of in-domain text (which we call the 
held-out set) when a number of other fully visible corpora, pos- 
sibly from different domains, are available on which to train. 
There is an implicit assumption that the held-out text is related 
to the training text and the learning task amounts to inferring 
this relation. The reason for learning the in-domain counts Xi 
of words Wi is so that the words may be ranked in order of pri- 
ority, enabling us to plot a curve relating a given vocabulary size 
to its OOV rate on the held-out corpus. Therefore, it is actually 
sufficient to learn some monotonic function f(xt) in place of 
the actual Xi. We may assume that the counts are normalized 
by document length so that the amount of available data for a 
particular corpus is itself irrelevant to the issue at hand. 

Table \\\ illustrates the problem; ntj are the visible counts 
from each of the documents j, for the word Wi, and the Xi are 
the incomplete counts for words Wi in the partially observable 
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Table 1 : Problem illustration. We wish to estimate some mono- 
tonic function of the true counts Xi for word Wi in the partially 
observed domain text based on a number of fully observed out- 
of-domain counts riij. 

Let Xi be some function <&; of the known counts mj for 
1 <= j <— m for each of the m corpora. Then, the problem 
can be restated as one of learning the $i from a set of examples 
where 

Xi = $i( n i,l> ' ' ' i n i,m) 

In the following section, we summarize three techniques 
for learning the that optimize the vocabulary for the domain 
from which the held-out data was drawn. 

3. Method 

For simplicity, let the <&i be linear functions of the mj and 
that they are independent of the particular words, u>i. That is, 
$ = $i = "!>_, , Wi, j. Then, we can write 
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The problem transforms into one of learning the Xj. We 
now outline three methods to do this. The first is based on max- 
imum likelihood (ML) count estimation, the second and third 
are based on document similarity measures. We evaluate each 
of these three methods against a fourth baseline method that 
simply assigns identical values to all the Xj . 

3.1. Maximum likelihood count estimation 

In ML count estimation, we simply interpret the normalized 
counts n-ij as probability estimates of Wi given corpus j and 
the Xj as mixture coefficients for a linear interpolation. We try 
to choose the Xj that maximize the probability of the in-domain 
corpus. Formally, let P(wi\j) = rii,j. Our goal is to find 



argmax I ^ XjP(wi\j) j 



(2) 



where C(w)j) is the count of m, in the partially observed held- 
out corpus and V is the set of words in the vocabulary. The 
Xj can subsequently be estimated via the EM algorithm 1 5 1 and 
used to calculate the interpolated normalized counts. The pro- 
cedure shown in Figure Q for instance, is effective in rapidly 
computing the values of the Xj. 

3.2. Document-similarity-based count estimation 

The document-similarity-based count estimation method calcu- 
lates interpolation weights from similarity measures between 
the held-out corpus and each of the training corpora. This sim- 
ilarity measure can presumably be calculated using any of a 



reestimated at each iteration until a convergence criterion de- 
termined by some threshold of incremental change is met. The 
likelihood of the held-out corpus increases monotonically until 
a local minimum has been reached. 



number of methods ranging from a simple Euclidean distance 
metric to a more sophisticated divergence measure between the 
observable probability distributions, such as Kullback-Liebler 
(KL) 1 6 1 or a symmetric variant |7|. 

The Euclidean distance metric is calculated as follows. 
Suppose we represent each document by a vector of its normal- 
ized word counts. Then, the Euclidean distance between two 
corpora C a and Cb, A(C a , Cb), is given by 



A(C a ,C b ) = 



\ 7 ( n <v 



Ub.i 



(7) 



where n aA and n^.i are the normalized counts of u>i in the cor- 
pora a and b, respectively. 

Likewise, the KL-divergence, which we again denote as 
A(C a , Cb) for the sake of uniformity, is given by 



A(C a ,C b ) =^P(a,i) log 2 



P(a,i) 

P(M) 



(8) 



where we interpret the normalized word counts as probabilities. 

In each of the above distance calculation schemes, let Dj 
be the distance of the jth corpus from the held-out domain text. 
Then, since the relevance of a corpus to the domain is inversely 
related to its distance from the domain, we define 
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3.3. Data sources and implementation 

The experimental setup consisted of learning the optimal vocab- 
ulary to model the language of the English broadcast news. A 
small amount of hand-corrected closed captioned data, amount- 
ing to just under 3 hours (about 25,000 words), drawn from 
six half-hour broadcast news segments from January 2001, was 
used as the partially visible held-out data to estimate the two 
mixture weights Ai and A2. This held-out data is part of the cor- 
pus released by the Linguistic Data Consortium (LDC) for the 
National Institute of Standards and Technology (NIST) spon- 
sored English topic detection and tracking (TDT4) task. 

The training corpora were deliberately chosen to be as dif- 
ferent from each other in character as possible. The first corpus 
consisted of about 18.5 million words of English newswire data 
covering the period July 1994 through July 1995, and was dis- 
tributed by the LDC for the NIST-sponsored Hub3 1995 con- 
tinuous speech recognition task. It contained text from The NY 



Times News Service, LA Times, Washington Post News Ser- 
vice, Wall Street Journal and Reuters North American Business 
News. The second training corpus consisted of a closer match 
to the target domain and came from segments of the TDT4 
dataset released by the LDC. This consisted of about 2.5 million 
words of closed captioned transcripts from the period Novem- 
ber through December 2000. 

Unigram counts for the training and held-out corpora were 
generated using language modeling tools from the SRILM 1 8 1 
using Witten-Bell |9| smoothing. Estimation of the Xj was per- 
formed on five of the six held-out segments which we collec- 
tively refer to as the development corpus, and OOV rates were 
measured on the remaining segment, which we refer to as the 
test corpus. This procedure was repeated six times, one for each 
possible split of the held-out data. The results we present are av- 
eraged numbers obtained from the six splits. Where applicable, 
we use the subscripts "hub3" and "tdt4" to refer to parameters 
specific to the above corpora. 

4. Results and Discussion 

We examine the results of our experiments to evaluate the var- 
ious methods. Figure [2] shows a plot of the OOV rate against 
increasing vocabulary size from 1 word to 90,000 words. This 
figure, which is plotted in the logarithmic scale, is only meant to 
show the general shape of the individual plots and for drawing 
some broad generalizations. For instance, we see confirmation 
of the common observation that the OOV rate of a given vocab- 
ulary on a corpus is logarithmically related to the vocabulary 
size. Furthermore, it is also evident that for small vocabular- 
ies there exist obvious differences in the performances of the 
various vocabulary selection methods. But for large vocabular- 
ies, this difference is seen to diminish. Indeed, for vocabulary 
sizes in excess of about 60,000 words, the four plots practically 
merge into a single line showing that at around that threshold 
and beyond, we capture practically all the words that are likely 
to be used in the domain under consideration, regardless of the 
specific method used to choose the vocabulary. 

For a finer-grained comparison of the individual techniques, 
we restrict our attention to the rectangular sub-region in Fig- 
ure [2] which is depicted in a separate plot in Figure [5] It 
shows the performance of the four systems for a vocabulary 
range of 1,000 to 2,000 words. The trend of the curves in this 
graph, which continues up to a vocabulary size of around 40,000 
words, clearly shows that the ML method outperforms all the 
other three methods by over 1% absolute. It is also clear that 
the method based on KL-divergence is the poorest of all, per- 
forming worse than even the uniform baseline. The Euclidean- 
distance-based method performs almost identically as the uni- 
form baseline (and thus the plot for the latter, being almost hid- 
den behind that of the former, is barely visible). 

In hindsight, the relatively good performance of the 
maximum-likelihood-based method is not very surprising be- 
cause it is the only method that does not look beyond the de- 
velopment corpus vocabulary to compute its objective func- 
tion. Both the KL-divergence-based method and the Euclidean- 
distance-based method sum quantities over the entire vocabu- 
lary and are therefore affected by the values held by individ- 
ual words that were not seen in the development corpus. This 
problem is especially acute because the actual vocabulary of the 
partially visible development corpus is typically tiny compared 
to the vocabularies of the training corpora. The KL-divergence- 
based method is affected most by this situation. Because KL- 
divergence involves calculation of log-probabilities, the method 
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Figure 2: Averaged OOV rate across the six test corpora when 
the vocabulary was determined by each of the four methods de- 
scribed in this paper. This plot is meant for the purpose of de- 
picting the general trend. Expansions of the rectangular enclo- 
sure in a subsequent plot will serve as a more detailed point of 
discussion. 



is extremely sensitive to the amount of probability mass devoted 
to unseen vocabulary items and consequently to the particular 
form of smoothing employed. Since a significant number of 
words in the vocabulary are typically unseen in the development 
corpus, these end up with very low unigram probabilities. Thus, 
in summing over the entire vocabulary, large negative numbers 
come into play which overshadow any significant contribution 
to the total divergence by the unigrams observed in the devel- 
opment corpus. We suspect therefore that we must not attach 
much significance to the final quantity computed by this method 
unless the size of the development corpus itself is substantial. 

The Euclidean method is also likewise affected, but to a 
lesser degree and slightly differently. The computed distances 
tend to be dominated by words that are absent in the develop- 
ment corpus rather than by words that are present in it. Since 
the absent words form the bulk of the vocabulary, the distances 
computed between the various corpora and the development 
text, and consequently the Xj will all roughly be the same, as 
evidenced by the figures in Table|2| 

5. Conclusions 

We have outlined three general techniques to select an optimal 
vocabulary for domain-specific speech and language modeling 
tasks. The techniques are scalable to arbitrarily large-sized cor- 
pora and extensible to any number of corpora. Whenever rea- 
sonable amounts of training data and reliable unigram count es- 
timates are available, we believe that the maximum-likelihood- 
based method we have described is a robust way to select a 
domain's vocabulary especially when its size is expected to be 
under a certain threshold. This threshold can be expected to 
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Figure 3: Averaged OOV rate across the six test corpora when 
the vocabulary was determined by each of the four methods de- 
scribed in this paper. The plot shows the segment of the OOV 
rate curve for a vocabulary size in the range of 1,000 to 2,000 
words. 
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Table 2: Inferred interpolation weights Aj, along with the nor- 
malized corpus distances from the domain text for the distance- 
based methods. All figures are averaged across all six splits of 
the test data. 



vary between domains and it is possible that when it is high, the 
choice of any particular strategy over another does not matter. 
However, we believe that always following a principled strategy 
to select the vocabulary offers the safest path. 

We plan to continue to refine and evaluate the techniques 
presented in this paper and apply them for vocabulary selection 
in the English broadcast news recognition task of the NIST 2003 
Hub4 evaluation. 
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